Emotional Voice Conversion with Adaptive Scales F0 Based on Wavelet Transform Using Limited Amount of Emotional Data
نویسندگان
چکیده
Deep learning techniques have been successfully applied to speech processing. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefficients (MCC), which represent the spectrum features in voice conversion (VC) tasks. Despite these successes, the approach is restricted to problems with moderate dimension and sufficient data. Thus, in emotional VC tasks, it is hard to deal with a simple representation of fundamental frequency (F0), which is the most important feature in emotional voice representation, Another problem is that there are insufficient emotional data for training. To deal with these two problems, in this paper, we propose the adaptive scales continuous wavelet transform (AS-CWT) method to systematically capture the F0 features of different temporal scales, which can represent different prosodic levels ranging from micro-prosody to sentence levels. Meanwhile, we also use the pre-trained conversion functions obtained from other emotional datasets to synthesize new emotional data as additional training samples for target emotional voice conversion. Experimental results indicate that our proposed method achieves the best performance in both objective and subjective evaluations.
منابع مشابه
Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform
An artificial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefficients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an...
متن کاملEmotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform
An artificial neural network is an important model for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as Mel Cepstral Coefficients (MCC), which represent the spectrum features. However, a simple representation of fundamental frequency (F0) is not enough for NNs to deal with emotional voice VC. This is ...
متن کاملDeep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion
Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model timbre and prosody features using a deep bidirectional long shortterm memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of fundamental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to de...
متن کاملAn Emotion Recognition Approach based on Wavelet Transform and Second-Order Difference Plot of ECG
Emotion, as a psychophysiological state, plays an important role in human communications and daily life. Emotion studies related to the physiological signals are recently the subject of many researches. In This study a hybrid feature based approach was proposed to examine affective states. To this effect, Electrocardiogram (ECG) signals of 47 students were recorded using pictorial emotion elici...
متن کاملHierarchical modeling of F0 contours for voice conversion
Voice conversion systems deal with the conversion of a speech signal to sound as if it was uttered by another speaker. The conversion of the spectral features has attracted a lot of research attention but the conversion of pitch, modeling the speakerdependent prosody, is often achieved by just controlling the F0 level and range. However, the detailed prosody, including different linguistic unit...
متن کامل